Skip to content

KB Document metadata #6024

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 29, 2024
Merged

KB Document metadata #6024

merged 8 commits into from
May 29, 2024

Conversation

smithellis
Copy link
Contributor

@smithellis smithellis commented May 21, 2024

Enable meta data for KB articles and KB article listing pages:

  • Add row below title that indicates product, last updated date, and number of helpful votes
  • Various scss changes to support the layout
  • I implemented the django_jinja.contrib._humanize app in order to give us friendly last updated times instead of reinventing the wheel.
  • Changed view.py and facets.py files to support passing data needed
Screenshot 2024-05-21 at 7 58 59 PM

seScreenshot 2024-05-23 094456

@smithellis smithellis requested a review from escattone May 21, 2024 20:03
Copy link
Contributor

@escattone escattone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still reviewing, but wanted to make an early comment. Could you provide a screenshot of the desired metadata and its layout? I was given the following shot from Figma, which doesn't include the product but instead includes the read time, but maybe that's the wrong one?

image

@smithellis smithellis force-pushed the document-metadata branch from 6a85254 to 4cb8191 Compare May 22, 2024 00:15
@smithellis smithellis force-pushed the document-metadata branch 3 times, most recently from 2566ba9 to 9841862 Compare May 22, 2024 13:38
Add django.contrib.humanize
Update scss
Update view to pass `helpful_votes` and `product`
Update `document_meta` block in `document_macros.html`
Include `document_meta` in the `content_block`
@smithellis smithellis force-pushed the document-metadata branch 2 times, most recently from 20d216d to 639915c Compare May 23, 2024 13:57
- Modify `product/documents.html` to display metadata
- Modify scss to support layout needed
- Modify `facets.py` to return metadata needed
@smithellis smithellis force-pushed the document-metadata branch from 639915c to acbf68a Compare May 23, 2024 14:56
@smithellis smithellis requested a review from escattone May 23, 2024 15:37
Copy link
Contributor

@escattone escattone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only other thing I'd consider -- in addition to my comments below -- is that when you release this code -- which has a new data structure for the votes_dict cached under the votes_cache_key -- the prior cached value will clash with your current code. It might be worth considering how to handle that so it doesn't cause errors during the period until the cached content expires.

)
.values("revision_id")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't remove .values("revision_id") because it's required to group the later count annotations by revision, in other words, to generate vote counts per revision. It's a subtle thing with Django Querysets, but that's what values() does when used before an aggregation (see https://docs.djangoproject.com/en/5.0/topics/db/aggregation/#values).

Without it, the count annotations simply count the votes per vote, which doesn't make sense.

For example, within my local dev environment, without the .values("revision_id") clause, I get the following votes_query rows (notice mulitple rows with the same revision_id):

{'revision_id': 23, 'helpful_count': 0, 'total_count': 1, 'time_limited_count': 0}
{'revision_id': 23, 'helpful_count': 1, 'total_count': 1, 'time_limited_count': 1}
{'revision_id': 5, 'helpful_count': 1, 'total_count': 1, 'time_limited_count': 1}
{'revision_id': 18, 'helpful_count': 1, 'total_count': 1, 'time_limited_count': 1}
{'revision_id': 1, 'helpful_count': 1, 'total_count': 1, 'time_limited_count': 1}
{'revision_id': 18, 'helpful_count': 0, 'total_count': 1, 'time_limited_count': 0}

but with the .values("revision_id") clause included I get:

{'revision_id': 1, 'helpful_count': 1, 'total_count': 1, 'time_limited_count': 1}
{'revision_id': 5, 'helpful_count': 1, 'total_count': 1, 'time_limited_count': 1}
{'revision_id': 18, 'helpful_count': 1, 'total_count': 2, 'time_limited_count': 1}
{'revision_id': 23, 'helpful_count': 1, 'total_count': 2, 'time_limited_count': 1}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great catch. I literally thought it was redundant, but now I understand its role. Thanks for this.

.annotate(count=Count("*"))
.values("revision_id", "count")
.annotate(
helpful_count=Count(Case(When(helpful=True, then=1), output_field=IntegerField())),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're just counting a specific condition (when helpful=True), Django provides a simpler way to do this:

helpful_count=Count("id", filter=Q(helpful=True)),

Comment on lines 133 to 142
time_limited_count=Count(
Case(
When(
created__range=(Now() - timedelta(days=30), Now()),
helpful=True,
then=1,
),
output_field=IntegerField(),
)
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here, this can be simplified to:

time_limited_count=Count("id", filter=Q(helpful=True, created__range=(Now() - timedelta(days=30), Now())))

Comment on lines 149 to 153
row["revision_id"]: {
"helpful_count": row["helpful_count"],
"total_count": row["total_count"],
"time_limited_count": row["time_limited_count"],
}
Copy link
Contributor

@escattone escattone May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're going to retrieve all 3 of these values later on, I would recommend simply storing the 3 values as a tuple. Also, a tuple is much more memory efficient -- it uses more than 50% less memory in this case -- and the votes_dict will be quite large.

row["revision_id"]: (row["helpful_count"], row["total_count"], row["time_limited_count"])

and then later you can just retrieve them all like:

total_helpful_votes, total_votes, time_limited_helpful_votes = votes_dict.get(d.current_revision_id, (0, 0, 0))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another great catch. Thanks.

Comment on lines 163 to 164
total_votes = votes_dict.get(d.current_revision_id, {}).get("total_count", 0)
total_helpful_votes = votes_dict.get(d.current_revision_id, {}).get("helpful_count", 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To match my comment above:

total_helpful_votes, total_votes, time_limited_helpful_votes = votes_dict.get(d.current_revision_id, (0, 0, 0))

Comment on lines 176 to 178
helpful_votes=votes_dict.get(d.current_revision_id, {}).get(
"time_limited_count", 0
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, to match the comments above:

helpful_votes=time_limited_helpful_votes,

votes_query = (
HelpfulVote.objects.filter(
revision_id__in=qs.values_list("current_revision_id", flat=True),
created__range=(datetime.now() - timedelta(days=30), Now()),
Copy link
Contributor

@escattone escattone May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the created__range filter here worries me, because it makes this votes_query much more expensive. The HelpfulVote table is one of our largest tables, if not the largest, with more than 56 million rows collected over 12 years, and growing.

When I ran the votes_query in production without the created__range filter, it took a few seconds (about 3-4 seconds) -- the query as it is today is practically instantaneous -- and that query time will grow as the table grows over time. That could be a problem when the cache is cold and the user has to wait several seconds for a fresh query to complete in order for the results to display on a products page.

We may need to compromise and limit the votes_query to only consider votes within the past 4-5 years, to avoid delayed page results for the cold cache case. When I tested the query in production with a created__range filter that limits the query to the past 5 years, it was still relatively fast (less than a second). Given that most current document revisions will probably be less than 4-5 years old and also given that restricting the query by created__range prevents the cost of the query from growing as the table grows, it might be best to use a fixed created__range.

If we do decide to restrict the query to the past 4-5 years, we should probably use the same limit when gathering the votes for the document view.

I think even if we decide against restricting the query to the past 4-5 years, we should at least restrict the query to something -- like maybe 10 years -- just to prevent the cost of the query growing over time as the HelpfulVote table grows.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, the current revision of more than a third of all of the active KB articles is older than 5 years.

In [37]: Document.objects.filter(current_revision__created__lte=date(2019, 1, 1), is_archived=False, is_template=Fa
    ...: lse).count()
Out[37]: 13696

In [38]: Document.objects.filter(current_revision__isnull=False, is_archived=False, is_template=False).count()
Out[38]: 37062

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could reduce the size of the HelpfulVote table by about 25% by removing votes on revisions of documents that have been archived.

In [40]: HelpfulVote.objects.filter(revision__document__is_archived=True).count()
Out[40]: 14709060

In [41]: HelpfulVote.objects.count()
Out[41]: 56457295

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth more discussion. I'm not sure what the right answer is in the short term - I can't produce a % of votes without the total votes. I've excluded archived documents in the query.
If this turns out to be too intensive we can look at batch processing for totals and storing them somewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps for now, we shouldn't worry about it? We're already caching the results for 24 hours, so only a single request every 24 hours may experience the "cold cache" delay.

@@ -271,12 +271,22 @@ def document(request, document_slug, document=None):
breadcrumbs.append((product.get_absolute_url(), product.title))
# The list above was built backwards, so flip this.
breadcrumbs.reverse()
votes = HelpfulVote.objects.filter(revision=doc.current_revision).aggregate(
total_votes=Count("id"),
helpful_votes=Sum(Case(When(helpful=True, then=1), default=0, output_field=FloatField())),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this in my first review, but this can be simplified too:

        helpful_votes=Count("id", filter=Q(helpful=True)),

Comment on lines 118 to 128
<div id="document_metadata" class="has-border-top">
<span class="product">{% for product in document['products'] %}{{ product }}{% endfor %}</span>
<span class="last-updated">
<img class="pencil" src="{{ webpack_static('sumo/img/pencil.svg') }}" /> <strong>Last updated:</strong> <span class="time">{{ document['created']|naturaltime }}</span>
</span>
{% if document['percent_helpful']>0 %}
<span class="helpful-info">
<img class="thumbsup" src="{{ webpack_static('sumo/img/thumbs-up.svg') }}"/><span class="helpful-count">{{ document['percent_helpful'] }}%</span> of users found this helpful
</span>
{% endif %}
</div>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use the document_metadata macro here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I had talked myself out of that since it lives in a different app. I have now altered it and implemented it.

created__range=(datetime.now() - timedelta(days=30), Now()),
helpful=True,
)
HelpfulVote.objects.filter(revision__document__is_archived=False)
Copy link
Contributor

@escattone escattone May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this .filter(revision__document__is_archived=False) is necessary, given that we're already limiting the revisions to those within the qs and the qs is already filtering out the revisions of archived documents. It may even cause the votes_query to be a bit more costly since now it has to join to both the Revision and Document tables, so I would recommend removing this.

Copy link
Contributor

@escattone escattone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that in addition to removing .filter(revision__document__is_archived=False) from the votes_query, you probably should retrieve/store the votes_dict from/in a different key -- maybe just version the key -- so that the old data in the cache doesn't cause failures for the initial period of up to 24 hours (the cache timeout) after the release of this new code. Maybe something like:

...
votes_cache_key = f"votes_for_v2:{cache_key}"
if votes_dict is None:
...

@@ -94,6 +96,7 @@
{% endif %}
<article class="wiki-doc">
{{ document_messages(document, redirected_from) }}
{{ document_metadata(document.current_revision.created, helpful_votes, products, metadata_type) }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. Maybe use document_metadata(..., metadata_type="document") or document_metadata(..., "document") instead of having to define a separate Jinja variable with {% set metadata_type = 'document' %}?

@@ -115,6 +117,7 @@ <h2 class="sumo-card-heading">
</a>
</h2>
<p>{{ document['document_summary'] }}</p>
{{ document_metadata(document['created'], document['percent_helpful'], document['products'], metadata_type) }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice that you were able to use the macro here!

Nit. Maybe use document_metadata(..., metadata_type="list") or document_metadata(..., "list") instead of having to define a separate Jinja variable with {% set metadata_type = "list" %}?

@escattone escattone force-pushed the document-metadata branch 3 times, most recently from c6859dd to ed56a7c Compare May 29, 2024 02:19
@escattone escattone force-pushed the document-metadata branch from ed56a7c to 7b58ce9 Compare May 29, 2024 16:29
@escattone
Copy link
Contributor

escattone commented May 29, 2024

After a SuMO platform team discussion, we decided not to show the helpful votes metadata for the "list" of documents case in order to avoid the performance issue for now. We will show the helpful votes metadata for the single document case. We also decided to truncate the products metadata -- a comma-separated list of product titles associated with the document -- to 50 characters, although when the helpful votes metadata is included, I've reduced it to 25 characters in order to avoid overflow when the last updated metadata is long for cases like Last updated: 4 months, 1 week ago.

@escattone escattone merged commit 88ac3af into mozilla:main May 29, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants